Method and system for speech recognition

ABSTRACT

A description is given of a speech recognition system in which a speech signal of a user is analyzed so as to recognize speech information contained in the speech signal. In a test procedure the recognition result with the most probable match is converted into a speech signal again so as to be output to the user for verification and/or correction. During the analysis there is generated a number of alternative recognition results which match the speech signal to be recognized with the next-highest probabilities. The output within the test procedure is performed in such a manner that, in the case of output of an incorrect recognition result, the user can interrupt the output. In that case respective corresponding segments of the alternative recognition results are output automatically for a segment of the relevant recognition result which has been output last before an interruption, so that the user can make a selection therefrom. The relevant segment in the supplied recognition result is subsequently corrected on the basis of the corresponding segment of a selected alternative recognition result. Finally, the test procedure is continued for the remaining, subsequent segments of the speech signal to be recognized. A corresponding speech recognition system is also described.

The invention relates to a method for speech recognition in which aspeech signal of a user is analyzed so as to recognize speechinformation contained in the speech signal and a recognition result witha most probable match is converted into a speech signal again within atest procedure and output to the user for verification and/orcorrection. The invention also relates to a speech recognition systemwhich includes a device for the detection of a speech signal of a user,including a speech recognition device for analyzing the detected speechsignal in order to recognize speech information contained in the speechsignal and to determine a recognition result with a most probable matchas well as a speech output device for converting the most probablerecognition result into speech information again within a test procedureand to output it to the user for verification and/or correction.

Speech recognition systems usually operate in such a manner that firstthe speech signal is spectrally or temporally analyzed and the analyzedspeech signal is subsequently compared in segments with different modelsof feasible signal sequences with known speech information. To this end,the speech output device usually comprises a complete library ofdifferent feasible signal sequences, for example, the words that makesense in a language. The model which best matches a given segment of thespeech signal is searched each time by comparing the received speechsignal with the available models so as to obtain a recognition result.Customarily the probability of belonging to the relevant associatedsegment of the speech signal is calculated for each model. In as far asthe speech signal concerns long texts, for example, one or moresentences, grammatical and/or linguistic rules are also taking intoaccount during the analysis and the calculation of the probability ofhow well the individual models match the relevant segments of a speechsignal. It is thus ensured not only that the individual segments of thelong speech signals suitably match the relevant models available, butalso that the context in which the speech signal segments occur is alsotaken into account in order to obtain a more sensible overallrecognition result, thus reducing the error quote. However, there stillis a residual probability that some sentences, parts of a sentence orwords of a spoken text are incorrectly understood nevertheless.

Therefore, for most applications it is necessary that a user of thespeech recognition system is given the opportunity to test therecognition result and to correct it if necessary. This is necessary inparticular in the cases where the relevant user does not obtain a directfeedback regarding an entry, for example, in applications where the userspeaks a long text which is subsequently stored in the form of writtentext or in another machine-readable form (referred to hereinafter astext form for brevity). Typical examples in this respect are dictationsystems or applications in which messages are first converted into atext form which is subsequently processed or propagated via acommunication network, for example, as an e-mail, as a fax or as an SMS.A further application of this kind concerns an automatic translationsystem in which a speech signal is first converted into the text form,after which a translation into a different language is made on the basisof this text form and finally the translated text is converted into aspeech signal again so as to be output by means of a speech outputdevice. In conventional dictation systems linked to PCs the recognitionresult can be displayed directly in a text form on a display screen ofthe PC so that the user can correct the text by means of theconventional editing functions. This correction method, however, is notsuitable for applications which do not offer a possibility for visualdisplay of the recognized text, for example, when devices without asuitable display device are used, for example, “normal” telephones, orfor applications for partially sighted persons. In such cases it isnecessary to output the relevant recognition result to the user via anautomatic speech output device, for example, a text-to-speech generator,that is, in the form of speech, in such a manner that the user has thepossibility of confirming or correcting the recognition result.

A method of this kind is described, for example, in U.S. Pat. No.6,219,628 B1. The cited document mentions several possibilities forcorrection. According to one possibility the entire recognized messageis reproduced for the user and the user speaks the message once more ifthe recognition result does not correspond to the actually spokenmessage. This method is not very satisfactory, notably not incircumstances where the recognition error quote is comparatively high,for example, when a text is spoken in the presence of substantial noise,because the user may then have to speak the complete message a number oftimes so as to ultimately obtain the desired result. According to asecond version respective certainty factors are determined automaticallyfor given segments of the speech signal during the analysis of thespeech signal. Subsequently, only those segments of the text which havea low certainty factor are output again to the user, that is, segmentsfor which the probability that an error has occurred is highest.However, the text cannot be completely checked in this manner. Accordingto a third version it is arranged to reproduce the text in segments, forexample, in words or in sentences, and to insert a waiting interval ateach end of the segment; the user then has the opportunity toindividually confirm or reject every individual segment, for example, byway of the word “yes” or “no”. If the user remains silent for aprolonged period of time during the pause, this silence is interpretedas a confirmation. In as far as the user rejects a reproduced segment,the user has the opportunity to speak this complete segment once more.

Granted, this third version already saves the user a substantial amountof time and is more comfortable than the first version where thecomplete text must be spoken again. However, it still has the drawbackthat the user may have to speak the segment to be corrected a number oftimes again, that is, in particular in the case of difficult recognitioncircumstances in which a high error quote occurs. This method involves afurther problem when, for example, in the case of a particularlyexceptional pronunciation of a part of the text by the user (forexample, because of the user's dialect) the speech recognition systemdoes not have the optimum models available so that, even when the textis spoken several times, it produces an incorrect recognition result asthe most probable recognition result.

It is an object of the present invention to improve a method for speechrecognition and a system for speech recognition of the kind set forth insuch a manner that the correction of an incorrectly understood speechsignal can be performed in a faster and simpler manner which is alsomore comfortable for the user.

This object is achieved in that during the analysis directly a number ofalternative recognition results is generated, that is, at least onealternative, which match the speech signal to be recognized with thenext-highest probabilities. The output during the test procedure thentakes place in such a manner that the user can interrupt the output inthe case of incorrectness of the supplied recognition result. For asegment of the relevant recognition result which has been output lastbefore an interruption the corresponding segments of the alternativerecognition result are then automatically output, again in the form ofspeech, for selection by the user. Subsequently, the relevant segment inthe supplied recognition result is corrected on the basis of the segmentof one of the alternative recognition results selected by the user.Finally, the test procedure is continued for the remaining, subsequentsegments of the speech signal to be recognized.

This method utilizes the fact that the speech recognition device alreadyhas to test a plurality of alternative recognition results in respect oftheir probability anyway so as to determine the most probablerecognition result. Instead of rejecting the less probable results againduring the analysis, the speech recognition device now generates the nbest sentences or word hypothesis graphs as alternative recognitionresults and stores these alternatives, for example, in a buffer memoryfor the further test procedure. The amount of additional work to be doneby the speech recognition device is only very small. During the testprocedure this additional information can be used to offer the relevantuser alternatives for the incorrectly recognized segment of therecognition result. Because the probabilities of the variousalternatives differ only slightly in many cases, there is often acomparatively high probability that the user will find the correctrecognition result among the alternatives. The user can then simplyselect this correct alternative, without having to speak the relevanttext segment again. This eliminates the risk that the text segment whichhas been spoken again for the correction is incorrectly recognized onceagain.

The output of the recognition result during the test procedure can takeplace in such a manner that a short pause is inserted each time aftergiven segments and that in these pauses it is checked whether the userrejects the last segment of the recognition result, for example, by wayof the words “stop” or “no”. Preferably, however, the voice activity ofthe user is permanently monitored during the output of the recognitionresult. As soon as the user makes a comment during the output, theoutput is interrupted. This means that a so-called “barge-in” method isused. Unnecessary pauses can thus be dispensed with during the output,so that the test procedure can be very quickly terminated.

In order to avoid that the speech output of the recognition result isinterrupted also in cases where the user makes an utterance during thespeech output which causes an interruption of the output even though itwas not meant to do so since it was intended, for example, for otherpersons present in the room, it is arranged that the user canimmediately continue the output by speaking a given command such as, forexample, “continue”, without having to listen to the various alternativerecognition results first.

In conformity with a very advantageous version a request signal isoutput to the user if the user does not select any segment of thealternative recognition results because, for example, all recognitionresults were incorrect, thus requesting the user to speak the relevantsegment again for correction.

There are various possibilities for the selection of the suppliedalternative recognition results.

According to a first version the recognition results are successivelyoutput and subsequently it is awaited whether the user confirms therecognition result. In the case of a confirmation, the alternativerecognition result is accepted as being correct. Otherwise the nextalternative recognition result is output.

According to a second version, all alternative recognition results, orthe relevant segments of the alternative recognition results, arecontinuously output in succession and the user subsequently selects theappropriate recognition result. Preferably, each alternative recognitionresult is then output together with an indicator, for example, a digitor a letter, which is associated with the relevant recognition result.The user can then perform the selection of the relevant segment of thevarious alternative recognition results by inputting the indicatorsimply by speaking, for example, the relevant digit or letter.

In a further preferred version a key signal of a communication terminal,for example, a DTMF signal of a telephone set, is associated with theindicator. The selection of one of the segments is then performed byactuating the relevant key of the communication terminal. This offersthe advantage that the selection of the recognition result takes placewithout using an intermediate further speech recognition operation, sothat any errors introduced thereby are precluded.

Alternatively, a barge-in method can also be used for the output of thealternative recognition results. This means that in that case thesegments of the alternative recognition results are output without apause and the user simply says “stop” or “yes” or the like when thecorrect recognition result is output.

After a correction of a segment in a very advantageous version thevarious recognition results are evaluated again in respect of theirprobably of matching the relevant speech signal to be recognized, thatis, while taking into account the corrected segment as well as allpreviously confirmed or corrected segments. The test procedure is thencontinued by outputting the subsequent segment of the recognition resultwhich has the highest probability after the re-evaluation. As a resultof the re-evaluation on the basis of all previously corrected orconfirmed parts of the speech signal to be recognized, in acontext-dependent probability analysis the recognition result can bepermanently improved still in the course of the test procedure, thusreducing the probability of corrections being necessary in subsequentsections.

When long texts or messages are to be recognized, various possibilitiesare available for carrying out the test procedure.

According to one version, the test procedure is carried out only afterinput of a complete text by the user. The fact that the desired text hasbeen spoken completely can be signaled, for example, by the user bymeans of an appropriate command such as “end” or the like.

According to a further version, the test procedure is carried outalready after the input of a part of a complete text. This offers theadvantage that already verified or corrected parts of the text canpossibly be further processed in other components of the application orstored in a memory, without the speech recognition system still beingburdened thereby. For example, a test procedure can be carried out for apreviously input part of a text whenever a given length of the part ofthe text or speech signal is reached and/or when a speech pause of givenduration occurs and/or when the user specifies this by means of aspecial command.

A speech recognition system in accordance with the invention mustinclude a speech recognition device for the execution of the method inaccordance with the invention which is constructed in such a manner thatduring the analysis it generates a number of alternative recognitionresults and outputs or stores such results which, in relation to themost probable matching recognition result that is output anyway, matchthe speech signal to be recognized with the next-highest probabilities.Moreover, the speech recognition system requires means for interruptionof the output within the test procedure by the user as well as a dialogcontrol device which automatically outputs the corresponding segments ofthe alternative recognition results for a segment of the relevantrecognition result last output before an interruption. Furthermore, thespeech recognition system should include means for selecting one of thesupplied segments of the alternative recognition results as well as acorrection device for correcting the relevant segment in the recognitionresult output first on the basis of the corresponding segment of theselected alternative recognition result.

In as far as the selection of the alternative recognition result shouldtake place by means of a key signal of a communication terminal, thespeech recognition system should also include an appropriate interfacefor receiving such a key signal, for recognizing it and for using it toselect one of the supplied segments.

The speech recognition system in accordance with the invention canadvantageously be realized essentially by means of suitable software ona computer or in a speech control of an apparatus. For example, thespeech recognition device and the dialog control device can be realizedcompletely in the form of software modules. A device for generatingspeech on the basis of computer-readable texts, for example a so-calledTTS converter (Text-To-Speech converter) can also be realized by meansof appropriate software. It is merely necessary for the system tocomprise a facility for speech input, for example, a microphone with asuitable amplifier, and for speech output, for example, a loudspeakerwith a suitable amplifier.

The speech recognition system may then be present in a server which canbe reached via a customary communication network, for example, atelephone network or the Internet. In this case it suffices when thespeech input device and the speech output device, that is, themicrophone, the loudspeaker and relevant amplifiers, are present in acommunication terminal of the user which is connected to the server ofthe speech recognition system via the relevant network. Furthermore, itmay also be that the speech recognition system is not realized within asingle apparatus, for example, on a single server. Various components ofthe system may instead be situated in different locations which areinterconnected via a suitable network. The speech recognition system inaccordance with the invention may be associated with a very specificapplication, for example, an application which converts voicemailmessages within a communication system into SMS messages or e-mails.However, the speech recognition system may alternatively be available asa service system for a plurality of different applications, thus formingfor a plurality of applications an interface for the users of therelevant application.

The invention will be described in detail hereinafter on the basis of anembodiment as shown in the accompanying drawings. Therein:

FIG. 1 is a diagrammatic block diagram of a speech recognition system inaccordance with the invention, and

FIG. 2 shows a flow chart illustrating the correction method.

The embodiment of a speech recognition system 1 as shown in FIG. 1comprises an input 14 whereto a microphone 2 is connected via anamplifier 3. The speech recognition system 1 also includes an output 16whereto a loudspeaker 4 is connected, via an amplifier 5, in order tooutput speech signals. The microphone 2 and the associated amplifier 3and the loudspeaker 4 and the associated amplifier 5 form part of anapparatus which is remote from the speech recognition system 1 and whichcommunicates with the speech recognition system 1 via a communicationnetwork, for example, a telephone network.

The communication terminal also includes a keyboard 6 via which acousticsignals, for example, DTMF (Dual Tone Multi Frequency) signals can begenerated; these signals are also applied to the input 14 of the speechrecognition system via the speech signal channel.

Speech signals SI arriving at the input 14 from the microphone 2, viathe amplifier 3, are converted into a readable or machine-readable textby the speech recognition system 1 and conducted to an application 15,for example, for the transmission of SMS messages or e-mail; thisapplication subsequently processes and/or transmits said text dataaccordingly.

To this end, at the input side the acoustic signal first reaches aso-called Voice Activity Detector (VAD) 12 which tests the incomingsignal only as to whether there is actually an incoming speech signalS_(I) from a user or whether the signal concerns only background noiseetc. The speech signal S_(I) is then applied to a speech recognitiondevice 7 which analyzes the speech signal S_(I) in a customary manner inorder to recognize speech information contained therein and whichdetermines a recognition result with a most probable match.

In conformity with the invention the speech recognition device 7 isarranged in such a manner that in addition to the recognition resultwhich matches the speech recognition signal S_(I) to be recognized withthe highest probability, there is also generated a number of alternativerecognition results which match the speech recognition signal S_(I) tobe recognized with the next-highest probabilities.

The recognition result which matches the speech signal S_(I) to berecognized with the highest probability is then applied in text form toa dialog control device 1 0 which conducts this most probablerecognition result to a text-to-speech generator (TTS generator) 9. Thealternative recognition results can also be applied directly to thedialog control device 10 in which they are buffered, or can be stored ina separate memory 8 by the speech recognition device 7, which separatememory can be accessed at all times by the dialog control device 10.Using the TTS generator 9, the most probable recognition result is thenconverted into a speech signal and output in the form of speech, via theamplifier 5 and the loudspeaker 4, within a test procedure for theverification and/or correction by the user.

The exact execution of this test procedure will be described in detailhereinafter with reference to FIG. 2.

In the step I the method commences with the previously described speechinput. Subsequently, in the step II of the method the variousalternative recognition results are determined and ultimately evaluatedin the step III of the method in order to determine which recognitionresult best matches the speech signal S_(I) to be recognized.Subsequently, in the step IV of the method the most probable recognitionresult is output in segments, said output in segments taking placecontinuously so that the individual segments per se cannot be recognizedby the user. The individual segments may be, for example, the individualwords of a sentence or a word hypothesis graph or also parts of asentence or parts of a word hypothesis graph.

After each segment it is tested in the step V of the method whether theoutput is interrupted by the user. This is possible, for example, whenthe user expresses himself/herself accordingly during the output of therecognition result. The voice activity of the user is immediatelyrecognized by the VAD 12 which stops, via a corresponding control signalS_(C), the TTS generator 9 and at the same time applies the controlsignal S_(C) also to the dialog control device 10 so that the latteralso registers the interruption of the output by the user. If nointerruption takes place, it is tested whether the end of the input texthas been reached (step VI of the method). If this is the case, therecognition result is deemed to have been verified by the user and therecognition result is applied to the application 15 (step VII of themethod). If the end of the text has not yet been reached, the output ofthe most probable recognition result is continued.

However, if an interruption is registered in the step V of the method,in the step VIII of the method it is first determined which incorrectsegment is concerned. For the sake of simplicity it is assumed hereinthat this is the segment which has been output last, that is, thesegment output directly before the output was interrupted by the user.

The dialog control device 10 then accesses, in as far as the alternativerecognition results were not stored within the dialog control device 10itself, the buffer 8 and determines the corresponding segments of thealternative recognition results corresponding to the incorrect segmentdetermined in the step VIII of the method. The corresponding segments,or the alternative recognition results, are then associated withindicators, for example, the digits 1 to 0.

Via the TTS generator 9, the alternative segments then available areoutput, each time together with the associated indicators, in the formof speech to the user (step IX of the method).

In the step X of the method, finally, the user can select a suitablesegment from the alternative recognition results by depressing a key,corresponding to the indicator, on a keyboard 6. Pressing this keygenerates a DTMF signal which is conducted, via the speech channel, tothe input 14 of the speech recognition system 1. This DTMF signal isthen recognized by a DTMF recognizer 13 which is connected parallel tothe speech recognition device 7. The DTMF recognizer 13 outputs acorresponding selection signal S_(A) to the dialog control device 10,which signal triggers a correction unit 11 to replace the incorrectlyrecognized segment by the relevant segment of the selected alternativerecognition result (step XI of the method). The DTMF recognition unit 13can also apply a signal to the speech recognition device 7 uponrecognition of a DTMF signal, so that the speech recognition device 7,for example, is deactivated so that it does not unnecessarily attempt toanalyze the DTMF signal.

After successful correction, a re-evaluation of all recognition resultsis carried out in the step XII of the method, that is, of the mostprobable recognition result and the alternative recognition results.Preferably, this re-evaluation is performed in the speech recognitiondevice 7 which is also capable of accessing the buffer 8 or whichreceives the data required for this purpose from the dialog controldevice 10. This context-dependent re-evaluation of the recognitionresults takes into account all previously verified or correctedsegments, meaning that the fact is taken into account that theprobability is each time 100% for the relevant segments whereas for allalternative segments the probability is 0%. It can thus be achieved, forexample, that on the basis of the already known segments thosehypotheses which, without this prior knowledge, have a high probabilityare rejected while other hypotheses which originally have a lowprobability now become very probable. As a result, the error quote inthe output of the subsequent segments is significantly reduced and hencethe overall correction method is accelerated. Additionally oralternatively the reliably recognized parts of the utterance of the usercan also be taken into account for an adaptation of the language modelsand/or the acoustic models.

It is to be noted again that the described speech recognition system andthe execution of the method concern only a special embodiment of theinvention and that a person skilled in the art will be capable ofmodifying the speech recognition system and the method in various ways.For example, it is notably possible and also sensible to insert in themethod a step in which the user has the opportunity, in as far as noneof the segments of the alternative recognition results is deemed to becorrect, to speak the segment again. It is also possible that instead ofthe selection by means of a DTMF-capable keyboard 6, the selection isperformed by means of speech input or that the keyboard transmits othersignals which are applied, via a separate data channel, to the speechrecognition system 1 which can then process the signals accordingly.Similarly, the interruption of the speech output within the testprocedure may also take place by means of a specific DTMF signal or thelike.

1. A method for speech recognition in which a speech signal of a user isanalyzed so as to recognize speech information contained in the speechsignal and a recognition result with a most probable match is convertedinto a speech signal again within a test procedure and output to theuser for verification and/or correction, characterized in that duringthe analysis a number of alternative recognition results is generated,said alternative recognition results matching the speech signal to berecognized with the next-highest probabilities, and that the outputtakes place within the test procedure in such a manner that the user caninterrupt the output in the case of incorrectness of the suppliedrecognition result and that for a segment of the relevant recognitionresult which has been output last before an interruption thecorresponding segments of the alternative recognition results areautomatically output for selection by the user, and that finally therelevant segment in the supplied recognition result is corrected on thebasis of the corresponding segment of a selected alternative recognitionresult, after which the test procedure is continued for remaining,subsequent segments of the speech signal to be recognized.
 2. A methodas claimed in claim 1, characterized in that the voice activity of theuser is permanently monitored during the output of the recognitionresult within the test procedure and that the output is interrupted inresponse to the reception of a speech signal of the user.
 3. A method asclaimed in claim 1, characterized in that if no segment of thealternative recognition results is selected, a request signal is outputrequesting the user to speak the relevant segment again for correction.4. A method as claimed in claim 1, characterized in that with eachalternative recognition result there is associated an indicator and thatduring the test procedure the relevant segments of the alternativerecognition results are output each time together with the associatedindicator and the selection of a segment of an alternative recognitionresult takes place by inputting the indicator.
 5. A method as claimed inclaim 4, characterized in that the indicator is a digit or a letter. 6.A method as claimed in claim 4, characterized in that with the indicatorthere is associated a key signal of a communication terminal and thatthe selection of a segment of an alternative recognition result takesplace by actuation of the relevant key of the communication terminal. 7.A method as claimed in claim 1, characterized in that, after acorrection of a segment output within the test procedure, the variousrecognition results are re-evaluated in respect of their probability ofmatching the relevant speech signal to be recognized, that is, whiletaking into account the segment corrected last and/or the alreadypreviously confirmed or corrected segments, the test procedure beingcontinued with the output of the next segment of the recognition resultwhich exhibits the highest probability after the re-evaluation.
 8. Amethod as claimed in claim 1, characterized in that the test proceduretakes place only after termination of the input of a complete text bythe user.
 9. A method as claimed in claim 1, characterized in that thetest procedure takes place already after the input of a part of acomplete text by the user.
 10. A speech recognition system (1) whichcomprises: a device (2) for detecting a speech signal of a user, aspeech recognition device (7) for analyzing the detected speech signal(S_(I)) for the recognition of speech information contained in thespeech signal (S_(I)) and for determining a recognition result with amost probable match, and a speech output device (9) for converting themost probable recognition result into speech information again within atest procedure and to output it to the user for verification and/orcorrection, characterized in that the speech recognition device (7) isconstructed in such a manner that during the analysis it generates anumber of alternative recognition results which match the speech signal(S_(I)) to be recognized with the next-highest probabilities, and thatthe speech recognition system (1) comprises: means (12) for interruptingthe output during the test procedure by the user, a dialog controldevice (10) which automatically outputs respective correspondingsegments of the alternative recognition results for a segment of therelevant recognition result output last before an interruption, means(6, 13) for selecting one of the supplied segments of the alternativerecognition results, and a correction unit (11) for the correction ofthe relevant segment in the recognition result output next on the basisof the corresponding segment of a selected alternative recognitionresult.
 11. A computer program product which comprises program codemeans for executing all steps of a method as claimed in claim 1 when theprogram is run on a computer.